deepspeed-chat: calculate loss in fp32 when using bf16#754
deepspeed-chat: calculate loss in fp32 when using bf16#754tjruwase merged 1 commit intodeepspeedai:masterfrom
Conversation
|
@misland-habana, this is great! But I am curious, is there any reason that this cannot be enable for fp16 as well (even fp8)? I think other dtypes could benefit from fp32 loss computation. |
|
@tjruwase, I enabled Bloom with bf16 and did not test the impact of using fp32 loss for fp16. I only tested it for bf16. I would guess that fp16 would also benefit (maybe less than bf16, due to fp16 having more bits for mantissa). As for fp8, I agree that it may help. |
|
@mosheisland, in that case do you mind changing the cli to something like |
|
@tjruwase - sure, I will do it. |
Using loss in fp32 can improve training accuracy for all 3 stages. This was tested with Bloom model using bf16 dtype While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Also, in stage3, when using bf16 and tensorboard enabled, we record the actor and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts it to numpy. This fails since numpy does not support conversion from tensor to bf16. Fix it by logging to tensorboard the loss.item(). Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <misland@habana.ai>
5dda2d4 to
044bd98
Compare
|
@tjruwase, I have uploaded a new commit with --compute_fp32_loss. While testing it, I encountered an issue in upstream that happens when you use bf16 and tensorboard. |
Using loss in fp32 can improve training accuracy for all 3 stages. This was tested with Bloom model using bf16 dtype While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Also, in stage3, when using bf16 and tensorboard enabled, we record the actor and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts it to numpy. This fails since numpy does not support conversion from tensor to bf16. Fix it by logging to tensorboard the loss.item(). Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>
Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss.
While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model.
Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405